Contents

 

Problem Presentation

Let's read the numbers in Italy:

Cancer numbers in Italy 2021 confirm that breast cancer is the most diagnosed neoplasm in women, in which about one in three malignancies (30%) is breast cancer.

According to data from the report Cancer numbers in Italy 2021 in Italy there are an estimated 55,000 new diagnoses of female breast cancer in 2020 and 12,500 deaths are estimated in 2021. The net survival 5 years after diagnosis is 88%.

According to ISTAT data, in 2018 breast cancer represented, with 13,076 deaths, the leading cause of death from cancer in women.

Since the end of the nineties there has been a continuous trend towards a decrease in mortality from breast cancer (-0.8% / year), attributable to a greater diffusion of early diagnosis programs and therefore to diagnostic anticipation and also to therapeutic progress.

 

Breast Cancer History

Humans have known about breast cancer for a long time. For example, Edwin Smith's surgical papyrus describes cases of breast cancer. This medical text dates back to 3000-2500 BC. Hippocrates described the stages of breast cancer in the early 400 BC. In the first century, doctors experimented with surgical incisions to destroy tumors.

Our modern approach to breast cancer treatment and research began to take shape in the 19th century. We consider these milestones:

  • 1882: William Halsted performs the first radical mastectomy. This surgery will remain the standard operation for breast cancer treatment until the 20th century.

  • 1895: The first X-ray is taken. Eventually, low-dose X-rays called mammograms will be used to detect breast cancer.

  • 1937: In addition to surgery, radiotherapy is used to spare the breast. After removing the tumor, needles with radius are inserted into the breast and near the lymph nodes.

  • 2013: The four major subtypes of breast cancer are defined as HR + / HER2 ("luminal A"), HR- / HER2 ("triple negative"), HR + / HER2 + ("luminal B") and HR- / HER2 + ("HER2-enriched ").

  • 2018: A clinical study suggests post-surgery chemotherapy does not benefit 70% of women with early-stage breast cancer.

Breast cancer treatment is becoming more personalized as doctors learn more about the disease. It is now seen as a disease with subtypes that have different patterns and ways of acting on the body. The ability to isolate specific genes and classify breast cancer is the start of more personalized treatment options. Special tests can also tell doctors more about breast cancer.

Why this topic?

 

In today's world, especially among young university colleagues, having a method with a performance close to 99% has become a challenge and the only goal to be achieved. Obviously, putting an excellent model with good predictive skills into production is the goal of any data scientist, but too often we forget that data science is only the means and not the end. If we really want this science to accompany us in everyday life, we must begin to see it only as a tool with which to solve the problems that are placed before us and we must understand that the study and analysis of the latter is much more important of the technique used to solve them.

For this reason I have chosen to bring a problem of the healthcare sector, which is the sector that can really make us understand how useful these techniques are to help us in daily life and above all make us understand that improving the forecasting capacity of a model in this case can it means saving a human life.

Data Science is not R. Data Science is not Python. Data Science is not SQL. Data Science is not Spark. Data Science is not TensorFlow.

Data Science is using above tools and techniques, and if required inventing new tools and techniques, to solve a problem using "data" in a "scientific" way.

 

Dataset

The key challenge of this study is the classification of breast tumors as malignant (cancerous) or benign (non-cancerous) on Breast Cancer Wisconsin (Diagnostic) Dataset which contains only quantitative features. These are the features we have available; id is the identification number of the patient, we drop it because it is not useful for the analysis and the target variable we are interested in predicting is diagnosis which refers to the state of the cancer. We treat the reference variable as a factor based on two levels, 1 if it assumes value M and 0 if it assumes value B.

Rows: 569
Columns: 32
$ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, ~
$ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
$ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
$ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
$ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
$ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
$ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
$ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
$ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
$ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
$ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
$ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
$ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
$ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
$ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
$ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
$ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
$ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
$ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
$ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
$ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
$ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
$ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
$ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
$ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
$ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
$ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
$ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
$ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
$ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
$ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
$ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~

 

Features Extraction

The diagnosis of breast tumors has traditionally been performed by a fully biopsy, an invasive surgical procedure. Fine needle aspirations (FNAs) is a type of biopsy procedure which consists of inserting a thin needle into an area of abnormal-appearing tissue or body fluid. The aspirated material is then expressed onto a glass side and stained. The image for digital analysis is generated color video camera mounted atop an Olympus microscope and the image is projected into the camera. Then the image is captured.

The first step in successfully analyzing the digital image is to specify an accurate location of each cell nucleus boundary. The actual boundary of the cell nucleus is located by an active contour model known in the literature as a Snake. This is a feature extraction technique which initially identifies the approximate Region Of Interest (ROI), by removing the non-tumor part, and then minimizes an energy function defined over the arclength of a closed curve. The energy function is defined in such a way that the minimum value occurs when the curve accurately corresponds to the boundary of a cell nucleus.

Here, we can see a pratical example of lesion segmentation on a breast MRI (Magnetic Resonance Imaging) scan:

Lesion segmentation on a breast MRI scan

 

    1. Locate a rectangle ROI box that contained a postcontrast breast MRI lesion.
    1. Initial segmentation by the FCMs-based method.
    1. deformation of GVF snake using FCMs-based contour for inizialization.
    1. radiologists'manual delineation.

In the computerized segmentation section, FCMs clustering based method is used to produce an initial segmentation of the input image, while the gradient vector flow (GVF) snake model is applied to the initial segmentation to obtain the final segmentation. The initial segmentation method is referred to as the FCMs-based and the final segmentation method is referred to as the GVF-FCMs for short. The segmentation performance of both methods is evaluated with manual segmentation by experienced radiologists on dynamic contrast-enhanced (DCE) MRI.

This approach is very powerful with a very low average time cost and dynamic memory cost.

A set of 569 images has been processed in the manner described above and we have a dataset with 569 number of observations.

 

Features Information

The computer vision diagnostic system extracts ten different features from the snake-generated cell nuclei boundaries. All of the features are numerically modeled such that larger values will typically indicate a higher likelihood of malignancy.

The extracted features are as follows.

    1. Diagnosis (M = malignant, B = benign).
    1. Radius: the radius of an individual nucleus is measured by averaging the length of the radial line segments defined by the centroid of the snake and the individual snake points.
    1. Texture: the texture of the cell nucleus is measured by finding the variance of the grayscale intensities in the component pixels.
    1. Perimeter: the total distance between the snake points constitutes the nuclear perimeter.
    1. Area: nuclear area is measured by counting the number of pixels on the interior of the snake and adding one-half of the pixels in the perimeter.
    1. Smoothness: the smoothness of a nuclear contour is quantified by measuring the difference between the length of a radial line and the average length of the lines surrounding it.
Radial Lines Used for Smoothness

 

    1. Compactness: perimeter and area are combined to give a measure of the compactness of the cell nucleus using the formula \(\frac{perimeter^2}{area}\).
    1. Concavity: in a further attempt to capture shape information we measure the number and severity of concavities or indentations in a cell nucleus. We draw chords between non-adjacent snake points and measure the extent to which the actual boundary of the nucleus lies on the inside of each chord.
Chord Used to Compute Concavity

 

    1. Concave Points: this feature is similar to Concavity but measures only the number, rather than the magnitude, of contour concavities.
    1. Symmetry: in order to measure symmetry, the major axis, or longest chord through the center, is found. We then measure the length difference between lines perpendicular to the major axis to the cell boundary in both directions.
Segments Used in Symmetry Computation

 

    1. Fractal Dimension: it is approximated using the coastline approximation. The perimeter of the nucleus is measured using increasingly larger 'rulers'. As the ruler size increases, decreasing the precision of the measurement, the observed perimeter decreases.
Sequence of Measurements for Computing Fractal Dimension

 

As we can see above, the number of features increases because some statistical indices are calculated as, mean and standard deviation, on the 10 starting variables. The final dataset contains 32 features.

 

Exploratory Data Analysis

In the first phase of the exploratory analysis we want to calculate how many patients there are in each reference category. The dataset does not represent a typical medical analysis distribution in this case. Typically, we will have a considerable number of patients with a non-malignant tumor compared to a small number of patients with a malignant tumor.

 

We can see how the percentage of malignant cancers in this dataset is higher than what would be expected in a classic example of binary classification in the medical field. Usually in these applications we speak of Imbalanced Classification, certainly in this case we do not have a situation of equilibrium between the two classes, but the percentage of 37% for malignant cancers makes us conclude that we are facing a slight classification unbalanced.

 

We analyze the features and try to understand which features have larger predictive value considering the degree of separation of the two classes, on each feature. We don't have a perfect separation with any feature, but there are different good features like Concave Points Worst, Concavity Worst, Perimeter Worst, Radius Worst, Concavity Mean, Concave Points Mean, Area Mean, Perimeter Mean.

There are also features that have almost perfectly overlapping values for the two classes, such as Symmetry Se , Smoothness Se, Symmetry Mean and Texture Se.

We let's look at these features with a scatterplot conditionally on the status of the cancer.

 

As we can see, the separation is not clear for these features and this could mean that these variables do not have good predictive value.

 

 

Pearson Correlation

At this step, we compute the correlation matrix.

The correlation is measured considering the Pearson formula: \[ \rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y} \] Where:

  • \(Cov(X,Y)\) is the covariance between the two sets of values X and Y.
  • \(\sigma_X\) is the deviation standard of the set X.
  • \(\sigma_Y\) is the deviation standard of the set Y.

The highest correlations are between:

  • Perimeter Mean and Radius Worst.

  • Area Worst and Radius Worst.

  • Perimeter Worst and Radius Worst, Perimeter Mean, Area Worst, Area Mean, Radius Mean.

  • Texture Mean and Texture Worst.

     

     

Now, we can see the plots for some of these highly correlated features.

Positive Correlated Features Some pairs of variables such as Perimeter Mean - Radius Worst and Area Mean - Radius Worst offer a good level of separation between the two classes. The same is not true for the other two pairs of variables.

 

 

Uncorrelated Features In this case, we can see that there no kind of separation for the two classes.

 

 

Negative Correlated Features Here we can see some negative correlations, but with not very high values. We observe that the pair of variables Radius Mean - Fractal Dimension Mean offers an acceptable level of separation between the two classes.

 

 

Principal Component Analysis

We have too many features and this is a problem for the visualization but also for the interpretability of the model. The principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, using only the first few principal components which cumulate a chosen variance threshold. In this project we use this tool only for the visualization of the features, in particular we want to show the two groups of patients in the transformed space of the first two principal components.

We have a vector \(x_i \in \mathbb{R^p}\), and the goal is to find a way to map data such that: \[ x_i \rightarrow z_i \in \mathbb{R^q}, \,\,\, with \,\,\, q < p \] making sure that:

  • some relevant informations about the distribution of \(X\) are preserved.

  • it's possible to go back to the starting space \(z_i \rightarrow x_i\).

     

Goal:

  • transform the original variables \(X = (X_1, X_2, ..., X_p)'\), into \(Z = (Z_1, Z_2, ..., Z_p)'\).

  • the transformation is linear \[ Z_j = \phi_{1j}X_1 + \phi_{2j}X_2 + ... + \phi_{pj}X_p \] it is called \(j^{th}\) principal component. The coefficients: \[ \phi_{j} = (\phi_{1j}, \phi_{2j}, ..., \phi_{pj})' \] they are called loadings of the \(j^{th}\) principal component.

The principal components are:

  • sorted by decreasing variance: \(Var[Z_1] \ge Var[Z_2] \ge ... \ge Var[Z_p]\)

  • uncorrelated: \(Cov[Z_l, Z_m] = 0 \,\,\, \forall \,\,\, l \neq m\).

We also want the \(Z\) to preserve all the variance contained in \(X\).

 

Now, we want to analyze the behavior of the data transformed with the PCA in the new dimension. We are interested in understanding the relationship between the original variables and the principal components and choosing the appropriate number of components such as to accumulate a certain variance value.

 

The biplot is a visualization that allows to see the original units in the space of the principal components and overlap the directions of the original \(X\) variables in the space of the new features \(Z\).

 

Looking at the information on the variability that each represents component, we can say that the first component represents \(44.3\, \%\) of the variability, while the second component \(19 \, \%\). Vectors show us in which direction the original variables move in the \(Z\) domain. When \(Z_1\) and \(Z_2\) decrease, we are in a situation where malignant cancer diagnoses increase.

In particular, we can identify two groups of variables:

  • the first is located in the upper left and is strongly correlated negatively with the first component and positively with the second component.

  • the second group of variables is negatively correlated with both the first and the second component.

This makes us think that within the same class of malignant tumors, there may be factors (latent concepts) that distinguish two cells even if both are carriers of the disease.

 

In order to choose the number of principal components, we can look at the scree plot, which represents the amount of variability explained by each principal component with respect to the total variability contained in the data.

 

We recover \(79.3 \%\) of variance in the entire dataset using four principal components, so this is a good preservation of the result.

 

Now, we want to analyze the contribution of the original variables with respect to the first four components.

 

As we can see, the variables that explain most of the information along the first principal component are Concave Points Mean, Concavity Mean, Concave Point Worst, Compactness Mean and so on. This information is consistent with what was said previously for the biplot, as the variables that are most correlated to the principal component we are examining are also those that have a greater contribution, i.e. those with a longer direction vector.

The same conclusions can be made for the contributions of the variables to the other components.

 

 

Bayesian analysis

The main goal is to do a full Bayesian analysis, based on understanding whether a cancer is cancerous or non-cancerous. To evaluate the predictive capacity of the model, it was decided to split the dataset into training (80%) and test (20%). The createDataPartition command of the caret package was used to maintain the same balance of the reference classes, both for training and for test set.

From the correlation analysis, we noticed that there are many highly correlated features between them that can lead us to the problem of multicollinearity, that is, when in the specified model the predictors are correlated with each other. To solve this problem, we consider all pairs of variables and drop the features with correlation value \(\rho_{XY} \ge 0.7\). To do this we used the findCorrelation function of the caret package.

As we can see, at the end of this procedure, we pass from 30 predictors to 10 predictors:

Rows: 569
Columns: 11
$ diagnosis               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
$ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
$ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
$ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
$ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
$ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
$ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
$ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
$ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
$ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~

 

In this problem, the target variable diagnosis can be modeled as a Bernoulli distribution: \[ Y_i \in 0,1 \, where \, \, 1 \, \, is \, \, malignant\, \, (cancerous) \, \, and \, \, 0\, \, is\, \, benign\, \, (non \,\, cancerous) \] We consider the following Logistic Regression model with the logit as link function, which is the matematical expression that connects the value of the linear predictors with the response \(Y\). \[ Y_i \sim Bernoulli(p_i) \] \[ logit(p_i) = log\Big(\frac{p_i}{1-p_i}\Big) = \beta_1 \ Texture \ Mean \ + \beta_2 \ Area\ Mean \ + \beta_3 \ Symmetry\ Mean \ + \beta_4 \ Texture \ Se + \beta_5 \ \ Smoothness\ Se + \\ + \beta_6 \ Symmetry \ Se + \beta_7 \ Fractal \ Dimension \ Se + \beta_8 \ Smoothness \ Worst + \beta_9 \ Symmetry \ Worst + \beta_{10} \ Fractal \ Dimension \ Worst \]

 

One of the assumptions of logistic regression is that the features are independent of each other.

Model coefficients \(\beta_i\) are typically defined as Normal because they can take values between all real numbers. \[ \beta_i \sim N\Big(\mu = 0, \sigma^2 = \frac{1}{10^6}\Big) \] For the specification of the model, we don't consider the intercept

 

As we can see from the following plot, the balance of the response variable is respected after the division into train set and test set, to avoid introducing bias.

 

Implementing RJags

We have implemented the model using Rjags, as follow:

## Model 1 (Complete)
model <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <-  beta1*x1[i] + beta2*x2[i] +
      beta3*x3[i] + beta4*x4[i] + beta5*x5[i] +
      beta6*x6[i] + beta7*x7[i] + beta8*x8[i] +
      beta9*x9[i] + beta10*x10[i]}
  
  ## Defining the prior beta parameters
  beta1  ~ dnorm(0, 1.0E-6)
  beta2  ~ dnorm(0, 1.0E-6)
  beta3  ~ dnorm(0, 1.0E-6)
  beta4  ~ dnorm(0, 1.0E-6)
  beta5  ~ dnorm(0, 1.0E-6)
  beta6  ~ dnorm(0, 1.0E-6)
  beta7  ~ dnorm(0, 1.0E-6)
  beta8  ~ dnorm(0, 1.0E-6)
  beta9  ~ dnorm(0, 1.0E-6)
  beta10 ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags <- list("y" = y, "N" = N, "x1" = x1, "x2" = x2,
                  "x3" = x3, "x4" = x4, "x5" = x5,
                  "x6" = x6, "x7" = x7, "x8" = x8,
                  "x9" = x9, "x10" = x10)

## Defining parameters of interest to show after running RJags
mod.params <- c("beta1", "beta2", "beta3", "beta4",
                "beta5", "beta6", "beta7", "beta8",
                "beta9", "beta10")
## Bayesian Approach
## Run JAGS

## Model 1
mod_jags <- jags(data = data.jags,
                 n.iter = 10000,
                 model.file = model,      
                 n.chains = 3, 
                 parameters.to.save = mod.params,
                 n.burnin = 1000, n.thin = 10) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 456
##    Unobserved stochastic nodes: 10
##    Total graph size: 10063
## 
## Initializing model
Running Jags with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:

 

Summary table of \(\beta\) coefficients
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1 -0.0737 0.0445 -0.1639 -0.1027 -0.0735 -0.0436 0.0103 1.0015 1800
beta10 -31.5038 17.0101 -64.8325 -42.7513 -31.8779 -20.1080 0.2389 1.0011 2700
beta2 0.0092 0.0011 0.0073 0.0085 0.0092 0.0099 0.0113 1.0025 970
beta3 -80.1357 10.7527 -101.4617 -87.2797 -80.0974 -72.7272 -60.4596 1.0036 630
beta4 0.4216 0.4274 -0.4200 0.1360 0.4209 0.7062 1.2403 1.0018 1400
beta5 41.1128 113.6346 -183.2997 -35.5522 40.2877 119.2611 266.5889 1.0006 2700
beta6 -103.5617 34.4233 -170.3371 -126.6671 -103.5801 -79.5467 -36.9862 1.0008 2700
beta7 379.2194 102.8712 173.1646 313.1684 384.8905 444.0248 576.8855 1.0011 2700
beta8 10.6834 12.9146 -14.2873 2.2903 10.8980 19.2515 36.4702 1.0010 2700
beta9 36.4856 6.4848 24.0616 32.1903 36.3748 40.6673 49.8435 1.0013 2200
deviance 325.8906 6.4609 318.8481 322.3587 325.0252 328.4061 336.5689 1.0005 2700
DIC of the model 346.7765949.

 

We know that \(\beta_j\) measures the marginal impact of the covariate \(X_j\) on the log-odds in favor of \(Y = 1\). So \((e^{\beta_j} - 1) \times 100\) measures the percentage change rate of the odds in favor of \(Y = 1\) when the predictor \(X_j\) varies by one unit.

 

We have other interesting results:

  • Mean: point estimate for \(\beta_j\).

  • Sd: standard deviation for \(\beta_j\).

  • Rhat: is the potential scale reduction and it is a measure of the convergence of the chain; it is a comparison between chains variance and within chains variance; if they are similar they become from the same distribution. In our case the value of Rhat is always equal to 1.

  • n.eff: is the effective sample size that can be considered as the number of independent Monte Carlo samples necessary to same precision of the MCMC samples. Greater this value, lower the autocorrelation between the MCMC steps and better is the final approximation, because we have decreased the number of correlated samples during each iteration and we have independent samples.

  • DIC: is the Deviance Information Criteria, a generalization of the AIC and a measure of goodness-fit. Lower values are better. These criteria do not have an absolute scale and should be used only to rank models.

  • deviance: is the measure of the goodness of fit of the model at the different steps of the chain. We want it to be pretty stationary to be able to use it for DIC and therefore to understand if a model is good or not.

     

We can also compute credible intervals for the parameters.

Equi-tail intervals:

log odds
2.5% 97.5%
beta1 -0.1639294 0.0102767
beta10 -64.8324810 0.2388648
beta2 0.0073378 0.0113394
beta3 -101.4616525 -60.4595676
beta4 -0.4199928 1.2403322
beta5 -183.2997351 266.5889149
beta6 -170.3371420 -36.9862169
beta7 173.1646178 576.8854976
beta8 -14.2872721 36.4702192
beta9 24.0615947 49.8435244

HPD intervals:

log odds
lower upper
beta1 -0.1580455 0.0142097
beta10 -64.8503339 0.2657437
beta2 0.0073160 0.0112938
beta3 -100.9083440 -60.0598783
beta4 -0.3596252 1.2854830
beta5 -189.1564968 254.8279152
beta6 -168.3599686 -35.7949675
beta7 172.8495948 576.4033554
beta8 -14.3477747 36.4583913
beta9 24.2406558 49.8828444

 

Summary table of \(\beta\) coefficients
Estimate Std. Error z value Pr(>|z|)
x1 -0.0706 0.0433 -1.6299 0.1031
x2 0.0087 0.0010 8.6328 0.0000
x3 -76.4210 10.3103 -7.4121 0.0000
x4 0.4263 0.4149 1.0277 0.3041
x5 34.7422 115.2116 0.3016 0.7630
x6 -100.1797 33.9490 -2.9509 0.0032
x7 371.9737 92.0941 4.0391 0.0001
x8 10.8326 12.6072 0.8592 0.3902
x9 35.0198 6.3987 5.4730 0.0000
x10 -31.4309 15.9309 -1.9729 0.0485
With frequentist approach AIC = 335.4678473. As we can see from this summary, the estimade coefficients obtained by the frequentist approach are very similar to the ones obtained with jags. Also the AIC is similar. Maybe there are different features that are not so informative, thus looking at the confidence intervals, at the model diagnostics and after several attempts with different set of features, we can exclude different variables and re-estimate a new reduced model to compare with the full one.

 

Bayesian Logistic Regression with features selection

We try to eliminate some features and see what happens. The model is defined using Rjags, as follow:

## Model 2 (Reduced)
model2 <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <- beta1*x1[i] + beta2*x2[i] +
      beta3*x3[i] + beta6*x6[i] + beta7*x7[i] +
      beta9*x9[i]}
  
  ## Defining the prior beta parameters
  beta1  ~ dnorm(0, 1.0E-6)
  beta2  ~ dnorm(0, 1.0E-6)
  beta3  ~ dnorm(0, 1.0E-6)
  beta6  ~ dnorm(0, 1.0E-6)
  beta7  ~ dnorm(0, 1.0E-6)
  beta9  ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags2 <- list("y" = y, "N" = N, "x1" = x1,
                   "x2" = x2, "x3" = x3,
                   "x6" = x6, "x7" = x7, "x9" = x9)

## Defining parameters of interest to show after running RJags
mod.params2 <- c("beta1", "beta2", "beta3",
                 "beta6", "beta7", "beta9")
## Bayesian Approach
## Run JAGS

## Model 2
mod_jags2 <- jags(data = data.jags2,
                 n.iter = 10000,
                 model.file = model2,      
                 n.chains = 3, 
                 parameters.to.save = mod.params2,
                 n.burnin = 1000, n.thin = 10) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 456
##    Unobserved stochastic nodes: 6
##    Total graph size: 6583
## 
## Initializing model
Running Jags with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:

 

Summary table of \(\beta\) coefficients
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1 -0.0540 0.0314 -0.1167 -0.0748 -0.0537 -0.0327 0.0054 1.0032 730
beta2 0.0087 0.0010 0.0069 0.0081 0.0087 0.0093 0.0107 1.0036 640
beta3 -69.6686 8.9123 -86.8519 -75.4442 -69.7594 -63.7351 -52.8326 1.0050 440
beta6 -62.5022 23.2974 -107.6673 -78.6265 -62.3917 -46.9562 -16.4377 1.0006 2700
beta7 244.4337 65.2123 112.2856 200.6755 246.8055 288.9248 367.5216 1.0010 2700
beta9 27.0470 4.2223 19.1164 24.2326 27.0059 29.8305 35.5377 1.0038 600
deviance 327.7616 5.3962 322.8133 324.9668 326.8906 329.6372 336.2201 1.0014 2700
We observe that removing them the DIC reduces to 342.3314736.

 

Summary table of \(\beta\) coefficients

Estimate Std. Error z value Pr(>|z|)
x1 -0.0531 0.0317 -1.6759 0.0938
x2 0.0085 0.0009 9.2681 0.0000
x3 -67.9006 8.5548 -7.9371 0.0000
x6 -61.8177 22.4817 -2.7497 0.0060
x7 247.0764 64.7764 3.8143 0.0001
x9 26.2832 4.0435 6.5000 0.0000
With frequentist approach and features selection, the AIC values is: 333.5691347.

 

Diagnostic

At this point, we consider the reduced model like the best one and we go on. For the diagnostic part, we can see the plots of the Markov Chain, the density distribution of the values along the chain and the autocorrelations of each parameter.

Looking at the trace plot, it is possible to observe the trend of the parameter with respect to the iteration of one or more Markov chains. We would like to observe a stationary time series, which would mean that the trend of the parameter, on average, is almost constant in the long run (flat).

We can see that the Markov chain remains in the steady state and also the autocorrelation between consecutive values of the chain is very small. The autocorrelations drops around zero after few lags, meaning that there is not autocorrelation among different values in the chain. This happens for all the parameters but for \(\beta_1\) and \(\beta_2\) the autocorrelation decreases more slowly.

 

The Cumulative Means

Here, we want to compute the empirical average of \(\mathbf{\hat{I}}_{t}\) increasing the value of \(t \, = \, 1,...,T\), with the following formula of \(\mathbf{\hat{I}}_{t}\):

\[ \mathbf{I} \approx \mathbf{\hat{I}}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta^{(i)}) \]

Each parameter achieves in all of the three chains generated the same end point, so means that with different initial points in these three chains, we are going to have the same estimated mean parameter.

 

The Approximation Error

Now, we consider the approximation error for the MCMC sampling. We compute the square root of the MCSE.

The variance formula is: \[ \mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}[h(X_{1})]}{t_{eff}} = \Big( 1 + 2 \sum_{k=1}^{\infty} \rho_{k}\Big)\frac{\sigma^{2}}{T} \] The model we specified earlier has these effective samples size for each parameters:
n.eff
beta1 730
beta2 640
beta3 440
beta6 2700
beta7 2700
beta9 600
deviance 2700

 

Then we want to consider the MCSE that is the square root of the formula written above, so the results are:
MCSE
beta1 0.0006048
beta2 0.0000185
beta3 0.1716696
beta6 0.4671390
beta7 1.2175976
beta9 0.0807752
deviance 0.1040791
As we can see the \(\beta_7\) has the highest approximation error considering the jointly chains.

 

Posterior Uncertainty

Mean Sd Uncertainty
beta1 -0.0540273 0.0313761 0.5807454
beta2 0.0087138 0.0009798 0.1124473
beta3 -69.6685988 8.9123001 0.1279242
beta6 -62.5022100 23.2974092 0.3727454
beta7 244.4336595 65.2122575 0.2667892
beta9 27.0470111 4.2222632 0.1561083
deviance 327.7616204 5.3961565 0.0164637
The highest posterior uncertainty is about the \(\beta_1\).

 

The Estimated Correlations

In this step, we want to compute the correlations between all the values calculated during the MCMC sampling, looking at this matrix: The highest and positive correlation is between \(\beta_2\) and \(\beta_9\) and the highest negative correlation is between \(\beta_9\) and \(\beta_3\) .

 

Testing of the Convergences

We want to understand if we achieve with these multiple simulations of the markov chains, the convergences and the validity of the stationarity regions.

Raftery & Lewis Test

Raftery and Lewis introduced an MCMC diagnostic that estimates the number of iterations needed for a given level of precision in posterior samples:

[[1]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[2]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[3]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

We need 3746 samples to achieve these performances in these different chains.

 

Geweke Test

The Geweke diagnostics is based on a test for equality of the means of the first and last part of a Markov chain (by default the first 10% and the last 50%) using a mean difference test to see if the two parts of the chain come from the same distribution (null hypothesis). If the two averages are equal it is likely that the chain will converge.

The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation.
Z-score chain 1 Z-score chain 2 Z-score chain 3
beta1 -0.0795599 -0.6997935 -0.6453352
beta2 -2.1528704 -0.8336848 -0.3105050
beta3 2.1035146 1.0074784 0.5752448
beta6 1.1653385 1.2320715 -0.3730159
beta7 -1.4590239 -1.2396231 1.6495889
beta9 -1.2410542 -1.1371497 -0.5262548
deviance 1.1065939 1.2856623 1.2016557
As we can see, the majority of the values are inside the acceptance area so we almost always accept the equality of the means.

 

Gelman & Rubin Test

We know that trace plots provide an informal diagnostic about the convergence of the chains, we can use also another diagnostic called Gelman & Rubin diagnositc. This diagnostic calculates the variability within the chains and compares that to the variability between the chains.

The Potential Scale Factor is the statistics that tells about the chain convergence. The closer it is to 1 the better the convergence is. Values faraway from 1 indicate that the chains haven't reached yet convergence.
Point est. Upper C.I.
beta1 1.0005423 1.002736
beta2 1.0131851 1.049265
beta3 1.0141489 1.050655
beta6 1.0007899 1.003467
beta7 1.0009240 1.004519
beta9 1.0060780 1.024531
deviance 0.9999842 1.001437

All the values are near 1 which indicates that all 10000 iterations are enough for convergence to the stationary distribution.

 

Heidelberg and Welch Diagnostic

The Heidelberg and Welch diagnostic calculates a statistic test to accept or reject the null hypothesis that the Markov chain is from a stationary distribution.

The diagnostic consists of two parts:

  • it checks if the chain is stationary or not.
  • it checks the convergence of the stationarity chain in order to check if these estimates are accurate.
    Stationarity test Start iteration P-value Halfwidth test Mean Halfwidth
    beta1 passed 1 0.3095 passed -0.0524 0.0022
    beta2 passed 181 0.0532 passed 0.0088 0.0002
    beta3 passed 91 0.1002 passed -70.5577 0.9234
    beta6 passed 1 0.3277 passed -62.8875 1.7414
    beta7 passed 1 0.1981 passed 243.5105 5.4210
    beta9 passed 1 0.0749 passed 27.3031 0.3413
    deviance passed 1 0.1402 passed 327.7822 0.4135
All tests are passed so we can say that the Markov chain reaches the stationary distribution and that the accuracy of our estimations good (intervals small enough).

 

Credible Interval and Point Estimates

At this point, after checking that the simulations pass the convergence, we decide to compute credible intervals and the point estimates for our beta values estimated.

Point Estimates:

Point Estimates
beta1 -0.0540273
beta2 0.0087138
beta3 -69.6685988
beta6 -62.5022100
beta7 244.4336595
beta9 27.0470111
deviance 327.7616204

Equi-tail intervals:

log odds
2.5% 97.5%
beta1 -0.1166939 0.0053871
beta2 0.0069013 0.0106727
beta3 -86.8518876 -52.8325545
beta6 -107.6673089 -16.4377104
beta7 112.2856068 367.5216493
beta9 19.1163814 35.5376515

HPD intervals:

log odds
lower upper
beta1 -0.1144907 0.0072405
beta2 0.0067775 0.0105161
beta3 -85.7848650 -52.0747865
beta6 -105.0492906 -14.4295950
beta7 121.7873472 374.0515537
beta9 19.1100975 35.5459352

 

Model comparison

Let's compare the models using the DIC (Deviance Information Criterion), that is defined as follow:

\(DIC = D(\hat\theta) - 2 \, p_D\) where: \(D(\hat\theta) = -2 \, log\, f(y|\theta)\) is the deviance of the model and \(p_D\) is the number of effective ("unconstrained") parameters in the model.

Bayesian Approach
DIC
Model 1 (complete) 346.7766
Model 2 (reduced) 342.3315
The model with the best Deviance Information Criteria is the second one.

 

The Predictions

Here, we can observe the predictions with the second model using bayesian approach and see the performance on the unseen data using the confusion matrix, that is a tool for summarizing the performance of a classification method. Starting from this point, we can define different mertics that will be useful later on.

It's important to define the False Positive and False Negative that are the two types of errors. We can also think in terms of hypothesis testing and fix that:

  • Negative (-): Null hypothesis

  • Positive (+): Alternative hypothesis

In this case, if we fix a patient with the disease as positive, a false positive will be a patient who is actually negative but is incorrectly classified as sick. This type of corresponds to the first type of error (\(\alpha\)), that is rejecting the null hypothesis when it is true.

On the other hand, if a patient is actually sick and is incorrectly classified as not sick, we have a false negative, which corresponds to the second type error (\(\beta\)), that is non rejecting the null hypothesis when this is false.

Obviously, there is not always a balance between the two components of error, often we need to find the right trade-off. In this analysis, I think that is more important to minimize false negatives than false positives. In fact, if we generated more false negatives, it would mean not treating a person who is actually sick and therefore probably causing them to die. On the other side, if we generated more false positives, it would mean treating a patient who does not need it, because he is healthy. Maybe this problem could be solved with a enhanced analysis, possibly more in-depth for the actual evaluation of false positives, so as to be really sure if a person being held in hospital actually needs treatment.

I think it is useful to follow this approach, because especially in diseases of easy and fast development such as tumors, classifying a patient as healthy when it is actually not it means subjecting the patient to an annual check-up (for example), but by then it may be too late to intervene with a cure.

At this point, to minimize false negatives we consider sensitivity, that is the fraction of true positives correctly classified, since minimizing false negatives corresponds to maximizing true positives, this means that if a patient is actually sick (positive) I don't want to wrong classification. Obviously, we also want to keep the percentage of false positives under control, because we don't want to keep too many non-sick people in the hospital for further checks, because this would mean having more unnecessary costs and an overcrowding of the structure.

For this reason, as the first goal we would want to maximize sensitivity, but we also try to get a good score for specificity. We can try to balance this trade-off with different threshold values.

It is really interesting to note how the value of the threshold can affect the classification error.

For \(t = 0.1\), we have a very low false negative rate, which is reflected in a \(sensitivity = 0.95\). This is a very good performance, if it were not for the fact that the \(specificity < 0.50\), so it means we are more likely to classify a healthy patient if we assign them to a category at random (the probability is \(\frac{1}{2}\)) greater than using this method.

 

For \(t = 0.2\), we have a very low false negative rate, which is reflected in a \(sensitivity = 0.93\). This is a very good performance, and we have an important improvement for the \(specificity = 0.62\).

 

For \(t = 0.3\), we have a good performance on false negative rate, with a \(sensitivity = 0.88\).We are also able to manage the false positive, with a \(specificity = 0.65\).

 

For \(t = 0.4\), we have \(sensitivity = 0.86\) and \(specificity = 0.72\).

 

For \(t = 0.5\), we have an interesting performance. Considering the false negative rate, we have a \(sensitivity = 0.76\) and a \(specificity = 0.76\). Compared to the previous case, we have a profit of four percentage points on the correct classification of non-sick patients (increase in \(specificity\,\,\, by\,\,\, 0.72\,\, to\,\, 0.76\)), but we are wrong in the classification of twenty-four positive patients out of a hundred (decrease in \(sensitivity\,\,\, by\,\,\, 0.86\,\, to\,\, 0.76\))).

 

 

In the last two cases, we see how as the threshold increases, the sensitivity decreases, unlike the specificity and therefore we cannot consider these last two methods as optimal for the goal we want to achieve.

At this point, it is clear that the choice is between \(t = 0.2\) or \(t = 0.3\). We decide for model with \(t = 0.2\), which is the right compromise between the two goals, because surely we will have a lower number of false negatives compared to a small increase in false positives. Obviously, we will have a greater number of false positives, therefore people who in vain will wait for additional analyzes for the correct classification of their health status, but as they say in these cases prevention is better than cure.

 

Receiver Operating Characteristic (ROC Curve)

The ROC curve is a graphic scheme for a binary classifier, which can also be interpreted as a probability curve that plots the sensitivity and (1 - specificity) along the two axes, respectively represented by True Positive Rate (TPR, fraction of true positives) and False Positive Rate (FPR, fraction of false positives) - also known as Fall-Out - at various threshold values and separates the signal from the noise, allowing to study the relationships between instances actually positive (hit rate) and false alarms.

Through the analysis of the curves, the discriminatory capacity of a classifier is assessed between a set of positive and negative samples, calculating the area under the ROC curve, or the Area Under Curve (AUC). The AUC value assumes values in the range [0;1] and it is equivalent to the probability that the result of the classifier applied to an instance extracted at random from the group of positives is higher than that obtained by applying it to an instance extracted at random from the group of negatives.

In the following plot, we show specificity (the fraction of correctly classified true negatives) on the x-axis and sensitivity on the y-axis. We can derive these quantities with the following formulas:

\(True\, Positive\, Rate\, (TPR) = \frac{TP}{P} \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\, True\, Negative\, Rate\, (TNR) = \frac{TN}{N} = (1 - FPR)\)

 

Even looking at the ROC curve, it seems obvious to choose a value of \(t = 0.7\), which guarantees an \(AUC = 0.80\). The problem, in this case, is that this value is given more by specificity than by sensitivity, which in fact is below 0.75.

For this reason, we confirm the choice of \(t = 0.2\), with \(AUC = 0.77\), a \(sensitivity = 0.93\) and a \(specificity = 0.62\)

It is important to remember that the results obtained from this forecast are dependent on the specific split in training and testing done previously; this means that with another split procedure, that is with another training and test set, the performances will be different. For this reason, especially for future studies, a cross-validation method can be used to have a more robust and averaged evaluation of the method used.

Here, we can see a summary of all metrics for the different thresholds:
Sensitivity Specificity Precision F1 AUC
t = 0.1 0.9523810 0.4647887 0.5128205 0.6666667 0.7085848
t = 0.2 0.9285714 0.6197183 0.5909091 0.7222222 0.7741449
t = 0.3 0.8809524 0.6478873 0.5967742 0.7115385 0.7644199
t = 0.4 0.8571429 0.7183099 0.6428571 0.7346939 0.7877264
t = 0.5 0.7619048 0.7605634 0.6530612 0.7032967 0.7612341

 

Second Model

At this point, using the set of features of the reduced model, we want to implement another linear classifier, namely a probit model to study its diagnostics and forecasting capacity.

We have implemented the model using Rjags, as follow:

## Model Probit
model_probit <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    p[i] <- phi(beta1*x1[i] + beta2*x2[i] +
      beta3*x3[i] + beta6*x6[i] + beta7*x7[i] +
      beta9*x9[i])}
  
  ## Defining the prior beta parameters
  beta1  ~ dnorm(0, 1.0E-6)
  beta2  ~ dnorm(0, 1.0E-6)
  beta3  ~ dnorm(0, 1.0E-6)
  beta6  ~ dnorm(0, 1.0E-6)
  beta7  ~ dnorm(0, 1.0E-6)
  beta9  ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags_probit <- list("y" = y, "N" = N, "x1" = x1,
                         "x2" = x2, "x3" = x3, "x6" = x6,
                         "x7" = x7, "x9" = x9)
## Defining parameters of interest to show after running RJags
mod.params_probit <- c("beta1", "beta2", "beta3",
                       "beta6", "beta7", "beta9")
## Bayesian Approach
## Run JAGS

## Model Probit
## Bayesian Approach
mod_jags_probit <- jags(data = data.jags_probit,
                        n.iter = 10000,
                        model.file = model_probit,
                        n.chains = 3,
             parameters.to.save = mod.params_probit,
                        n.burnin = 1000, n.thin = 10) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 456
##    Unobserved stochastic nodes: 6
##    Total graph size: 6583
## 
## Initializing model

Summary table of \(\beta\) coefficients:

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1 -0.0292 0.0182 -0.0652 -0.0416 -0.0292 -0.0169 0.0054 1.0015 1900
beta2 0.0049 0.0005 0.0040 0.0046 0.0049 0.0052 0.0059 1.0067 320
beta3 -39.2792 4.7924 -48.9372 -42.4228 -39.1972 -36.1112 -30.2259 1.0013 2300
beta6 -37.2654 12.3441 -61.7162 -45.6099 -37.3259 -29.1238 -12.8072 1.0009 2700
beta7 140.8234 36.9233 67.0920 117.0872 141.5007 165.2947 212.4105 1.0061 850
beta9 15.2782 2.2708 11.1024 13.7799 15.1856 16.7505 19.8330 1.0006 2700
deviance 327.3179 5.0265 322.4518 324.6194 326.5371 329.0299 336.1120 1.0007 2700
DIC of the model 339.95727.

 

The Predictions

Even for the probit model, we understand the optimal value of \(t\), by looking at confusion matrix and ROC curve.

 

Receiver Operating Characteristic (ROC Curve)

For the probit model, the correct choice would seem to be \(t = 0.3\), with \(sensitivity = 0.93\) and \(specificity = 0.55\).

Here, we can see a summary of all metrics for the different thresholds:
Sensitivity Specificity Precision F1 AUC
t = 0.1 1.0000000 0.1830986 0.4200000 0.5915493 0.5915493
t = 0.2 0.9523810 0.4084507 0.4878049 0.6451613 0.6804158
t = 0.3 0.9285714 0.5492958 0.5492958 0.6902655 0.7389336
t = 0.4 0.8809524 0.6619718 0.6065574 0.7184466 0.7714621
t = 0.5 0.7142857 0.7183099 0.6000000 0.6521739 0.7162978

 

As we can see, the diagnosis of the model tells us that the parameters reach convergence, as in the case of the logistic model. Even when comparing the DIC values, they are quite similar to each other.

Bayesian Approach
DIC
Model Logit (complete) 346.7766
Model Logit (reduced) 342.3315
Model Probit (reduced) 339.9573
We remember that \(\beta_{logit} \approx 1.6 \times \beta_{probit}\) and they are very similar to the ones found with the logistic model.
Logit Probit
beta1 -0.0540273 -0.0466982
beta2 0.0087138 0.0078528
beta3 -69.6685988 -62.8467892
beta6 -62.5022100 -59.6246831
beta7 244.4336595 225.3173729
beta9 27.0470111 24.4450615
deviance 327.7616204 523.7086160

 

Recover model parameters with data simulated from the model

Now, we understood that the reduced logit is better than the reduced probit if we consider the predictions. So, for checking that the model proposed can correctly recover the model parameters we can execute the simulation considering the data simulated from the model proposed, the beta parameters checked are the estimated from the first model, considering these values we can continue with our last purpose:

We have implemented the model using Rjags, as follow:

## Model Logit (Simulated)
model_simulated <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <- beta1_sim*x1_simulation[i] +
      beta2_sim*x2_simulation[i] +
      beta3_sim*x3_simulation[i] +
      beta6_sim*x6_simulation[i] +
      beta7_sim*x7_simulation[i] +
      beta9_sim*x9_simulation[i]}
  
  ## Defining the prior beta parameters
  beta1_sim  ~ dnorm(0, 1.0E-6)
  beta2_sim  ~ dnorm(0, 1.0E-6)
  beta3_sim  ~ dnorm(0, 1.0E-6)
  beta6_sim  ~ dnorm(0, 1.0E-6)
  beta7_sim  ~ dnorm(0, 1.0E-6)
  beta9_sim  ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags_sim <- list("y" = y_simulation,
                      "N" = N,
                      "x1_simulation" = x1_simulation,
                      "x2_simulation" = x2_simulation,
                      "x3_simulation" = x3_simulation,
                      "x6_simulation" = x6_simulation,
                      "x7_simulation" = x7_simulation,
                      "x9_simulation" = x9_simulation)
## Defining parameters of interest to show after running RJags
mod.params_sim <- c("beta1_sim", "beta2_sim", "beta3_sim",
                    "beta6_sim", "beta7_sim", "beta9_sim")
## Bayesian Approach
## Run JAGS

## Model Logit Simulated
mod_jags_sim <- jags(data = data.jags_sim,
                 n.iter = 20000,
                 model.file = model_simulated,      
                 n.chains = 3, 
                 parameters.to.save = mod.params_sim,
                 n.burnin = 1000, n.thin = 20) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 569
##    Unobserved stochastic nodes: 6
##    Total graph size: 6958
## 
## Initializing model
Running Jags with 3 chains, 20000 iterations, a burn-in of 1000 steps and a thinning of 20 we have:

 

As we can see, the parameters are very similar from the reduced logit and this model is able to get the model parameters correctly based on simulated data.

 

Summary table of \(\beta\) coefficients:

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1_sim -0.0037 0.0323 -0.0650 -0.0261 -0.0037 0.0181 0.0586 1.0024 1100
beta2_sim 0.0085 0.0009 0.0068 0.0079 0.0085 0.0091 0.0103 1.0102 240
beta3_sim -84.7606 8.6898 -102.3195 -90.2722 -84.5709 -79.0609 -68.8216 1.0097 240
beta6_sim -99.7275 21.9301 -143.5143 -114.8321 -99.2103 -84.6228 -58.5241 1.0052 430
beta7_sim 269.2955 62.6820 148.3840 227.2848 268.7020 310.3443 394.2221 1.0014 2800
beta9_sim 35.1551 4.0926 27.8540 32.4818 35.0112 37.7974 43.3874 1.0040 570
deviance 310.9338 8.1653 305.7895 308.0786 309.9588 312.6673 319.5757 1.0025 1600
DIC of the model 344.2653825.

 

Diagnostic

 

Final Discussion

After this analysis it is clear the importance of understanding the problem you are looking at rather than trying different methods to get the best performance. In fact, even with a simple logistic regression we have good metric values.

Furthermore, it is also important to understand that it will be difficult to obtain a model that maximizes all the metrics in question, so we need to define the goal to be achieved and look at the metrics that interest us.

 

Future Work

Of course, we can try to improve performance in several ways:

  • resampling methods to balance the dataset.
  • consider a more complex model, maybe not a linear one.
  • use a set of different features or different priors for the parameters.
  • use cross-validation.